class: center, middle, inverse, title-slide .title[ # Quality control ] .author[ ###
James Ashmore
• 22-Oct-2022 ] .institute[ ### Zifo RnD Solutions ] --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!--------------- Only edit title, subtitle & author above this ---------------> ```r knitr::opts_chunk$set(echo = FALSE, fig.align = "center") ``` --- ## What is quality control? * From Wikipedia, the free encyclopedia: > Quality control, commonly shortened to QC, refers to all those processes and procedures designed to ensure that the results of laboratory analysis are consistent, comparable, accurate and within specified limits of precision. * In bioinformatics, quality control is performed at several stages: 1. Quality of the raw data 2. Effect of processing on the data 3. Correctness of the code and software 4. Interpretation of the results * For sequencing data, we often start by looking at the quality of the raw data * There are two types of quality metrics we can measure: * Generated by the sequencing platform (eg. quality scores) * Calculated directly from the raw reads (eg. base composition) --- ## Why is quality control important? * In bioinformatics, quality control is necessary to avoid drawing the **wrong** conclusions * There are many stages at which an error could be introduced: * The **code** you wrote is wrong * The **software** you used had a bug * The **data** you analysed was contaminated * Errors in code and software are subtle and require due care and attention * Problems with data are often less subtle and software can detect issues like contamination * What happens if we **don't** take care in our analysis? <br> <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/quality/xkcd_data_error.png" alt="xkcd: Data Error" width="65%" /> <p class="caption">xkcd: Data Error</p> </div> --- ## Retraction watch *A tale of caution...* .pull-left-50[ * The authors sequenced an ancient african genome from an individual they called "Mota" * They compared "Mota" to other human genomes from across the globe (SNP analysis) * Someone **forgot** to run a conversion script between one software and the next * Data was inadvertently lost which made "Mota" appear less European than he actually was * We don't penalize scientists for owning up to **mistakes** * It is more important that we correct the scientific record! ] .pull.right-50[ <img src="data:image/png;base64,#data/quality/bioinformatics-error-1.jpg" width="32%" style="display: block; margin: auto;" /> <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/quality/bioinformatics-error-2.webp" alt="Mota Cave" width="32%" /> <p class="caption">Mota Cave</p> </div> ] --- ## Sequencing reads *A quick refresher on Illumina sequencing...*
--- ## Sequencing reads *A quick refresher on FASTQ files...* .pull-left-55[ * An extension of FASTA format to handle base quality metrics from sequencing machines * Each sequence is represented by 4 lines: 1. Identifier of the sequence starting with `@` 2. Actual sequence 3. Optional identifier starting with `+` 4. Quality scores for each base * FASTQ files are created and compressed with extension `.fastq.gz` * For each sample per flow cell lane: * Single-end: `.fastq.gz` * Paired-end: `_1.fastq.gz` `_2.fastq.gz` ] .pull-right-45[ ```fastq @HWUSI-EAS100R:6:73:941:1973#0/1 TGAAGNCTATAAACTAAGAAGCAAGCACACTAGGAGTT + AAAAA#EEEEA/EEEEEEE6EAEAEEEEEEEEEEEEEE @HWUSI-EAS100R:6:73:942:1973#0/1 GTCACNATTCTCAAGGCCGTCGTCTTTTTAGTCGGTTT + AAAAA#EEEAEEAEEEEEEEAAAEEEEEE6E/E<E/AE @HWUSI-EAS100R:6:73:943:1973#0/1 GTCGCTAAGCTCTATATAGCGCGCGGGGGTATAGCTCG + AAAAA#EEEAEEAEEEEEEEEEEEEEEEE6E/E<EEEE ``` ] --- ## Quality control software * There are multiple tools available to quality check sequencing reads * In our *opinion* none of them are perfect, but there is a clear favorite which is [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) * Here's what Simon Andrews, the developer has to say: > FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. * FastQC can be run as a GUI application or on the command line - *we recommend the latter!* * You will learn to use FastQC in the workshop afterwards! --- ## FastQC <img src="data:image/png;base64,#data/quality/fastqc-preview.png" width="50%" style="display: block; margin: auto;" /> .pull-left-50[ **So how does FastQC work?** * A series of analysis modules are run on a small sample of the sequencing data * The module results are collected into a HTML report for inspection by the user * Each module measures some aspect of the data which might reveal problems with the sequencing ] .pull-right-50[ * A summary evaluation for each module is displayed: * Normal * Slightly abnormal * Very unusual * Do not take these evaluations too seriously - *they are there as a rough guide only!* ] --- ## FastQC ### Basic Statistics <br> <img src="data:image/png;base64,#data/quality/basic_statistics.png" width="90%" style="display: block; margin: auto;" /> <p style="text-align: center;"><i>Sorry for the low-quality image!</i></p> --- ## FastQC ### Per base sequence quality <img src="data:image/png;base64,#data/quality/per_base_quality.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Per tile sequence quality <img src="data:image/png;base64,#data/quality/per_tile_quality.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Per sequence quality scores <img src="data:image/png;base64,#data/quality/per_sequence_quality.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Per base sequence content <img src="data:image/png;base64,#data/quality/per_base_sequence_content.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Per sequence GC content <img src="data:image/png;base64,#data/quality/per_sequence_gc_content.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Per base N content <img src="data:image/png;base64,#data/quality/per_base_n_content.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Sequence length distribution <img src="data:image/png;base64,#data/quality/sequence_length_distribution.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Sequence duplicate levels <img src="data:image/png;base64,#data/quality/duplication_levels.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC ### Overrepresented sequences <br> <img src="data:image/png;base64,#data/quality/overrepresented_sequences.png" width="95%" style="display: block; margin: auto;" /> <p style="text-align: center;"><i>Again, sorry for the low-quality image!</i></p> --- ## FastQC ### Adapter content <img src="data:image/png;base64,#data/quality/adapter_content.png" width="80%" style="display: block; margin: auto;" /> --- ## FastQC *So what does 'good' data look like?* <iframe src="data/quality/fastqc-good.html" width="100%" height="500" data-external="1"></iframe> --- ## FastQC *And what does 'bad' data look like?* <iframe src="data/quality/fastqc-bad.html" width="100%" height="500" data-external="1"></iframe> --- ## FastQC *A demonstration by the one and only Simon Andrews...*
--- ## Sequencing adapter trimming .pull-left-50[ **Why do adapter sequences appear in sequencing reads?** * Sequencing usually starts at the 5' end just after the primer * The number of cycles dictates the number of bases sequenced * Adapter read-through happens when the length of the DNA insert is less than the number of cycles * No adapter read-through: * Insert length: 250 * Sequencing cyles: 200 * Adapter read-through: 200 - 250 = -50 * Adapter read-through: * Insert length: 100 * Sequencing cycles: 200 * Adapter read-through: 200 - 100 = 100 ] .pull-right-50[ <br> <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#data/quality/adapter-origin.png" alt="Credit: ecSeq" width="90%" /> <p class="caption">Credit: ecSeq</p> </div> ] --- ## Sequencing adapter trimming .pull-left-50[ **Why do we need to trim adapters from sequencing reads?** * The presence of **artificial** sequences can interfere with read alignment: * Reads may be **misaligned** due to artificial sequence similarity * Reads may be **unaligned** due to reduced sequence similarity * Trimming is essential for certain types of genomics data: * Genotyping * Genome assembly * Small RNA * Trimming may not have a huge effect on other types of data, but why take the chance? ] .pull-right-40[ <br> <img src="data:image/png;base64,#data/quality/why-adapter-trimming.jpeg" width="100%" style="display: block; margin: auto;" /> ] --- ## Adapter trimming software * There are multiple tools available to remove adapter sequences * In our *opinion* there is not much difference between them * They all have similar functionality and performance is comparable * Some of the most popular tools include: * [cutadapt](https://cutadapt.readthedocs.io/en/stable/) * [trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) * [fastp](https://github.com/OpenGene/fastp) * Good documentation is important, and for that reason we prefer **cutadapt** * You will learn to use Cutadapt in the workshop afterwards! --- ## Adapter trimming software * A quick look at the Cutadapt documentation <div class="figure" style="text-align: center"> <iframe src="https://cutadapt.readthedocs.io/en/stable/" width="90%" height="475" data-external="1"></iframe> <p class="caption">https://cutadapt.readthedocs.io/en/stable/</p> </div> --- ## Adapter trimming software * Cutadapt has a lot of convenient and intelligible features: * **Standard input and output** * Stream data without saving to disk * Save on storage and processing time * **Multi-core support** * Speed-up processing time of large files * Trimming less of a bottleneck in pipeline * **Multiple adapter types** * Remove 5' and 3' adapter sequences * Accommodates complex library preparation * **Quality trimming** * Remove low-quality base calls * Potentially improve alignment rate * **Read filtering** * Minimum read length * Discard both reads of a pair --- ## Adapter trimming software * Cutadapt can detect multiple adapter types: <img src="data:image/png;base64,#data/quality/cutadapt-adapter-types.jpeg" width="85%" style="display: block; margin: auto;" /> --- ## Adapter trimming software * A quick look at the Illumina adapter sequences document: <div class="figure" style="text-align: center"> <iframe src="https://support-docs.illumina.com/SHARE/AdapterSeq/illumina-adapter-sequences.pdf" width="90%" height="475" data-external="1"></iframe> <p class="caption">https://support-docs.illumina.com/SHARE/adapter-sequences.htm</p> </div> --- ## Adapter trimming software * Cutadapt performs adaptive quality trimming: <img src="data:image/png;base64,#data/quality/cutadapt-trimming-algorithm.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## Adapter trimming software * Trim a single 3’ adapter sequence: ```bash cutadapt -a ADAPTER -o output.fastq input.fastq ``` * Trim multiple 3' adapter sequences: ```bash cutadapt -a ADAPTER1 -a ADAPTER2 -o output.fastq input.fastq ``` * Trim low-quality bases from the 3' end: ```bash cutadapt -q QUALITY -o output.fastq input.fastq ``` * Trim paired-end reads: ```bash cutadapt -a ADAPTER1 -A ADAPTER2 -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq ``` * Discard reads shorter than LENGTH: ```bash cutadapt -m LENGTH -o output.fastq input.fastq ``` * Trim poly-A tails: ```bash cutadapt -a "A{100}" -o output.fastq input.fastq ``` --- ## Summary * Quality control is a necessary step in all bioinformatics analysis * It is there to prevent us drawing the "wrong" conclusions from the data * For sequencing data, we start by looking at the quality of the reads * FastQC is one such program you can use to quality check sequencing reads * FastQC modules should be evaluated in the context of the experimental design * Adapter contamination can be removed using read trimming software * Cutadapt is one such program you can use, but they all perform equally well * *Do not become paralysed by quality control! Know when to say good enough!* <!------------------------ Do not edit this and below -------------------------> --- name: end_slide class: end-slide, middle count: false # Thank you. Questions? .end-text[ <p class="smaller"> <span class="small" style="line-height: 1.2;">Graphics from </span><img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"><br> Created: 22-Oct-2022 • James Ashmore • <a href="https://www.zifornd.com/category/omics-bioinformatics">Bioinformatics</a> • <a href="https://www.zifornd.com">Zifo</a> </p> ]